Intro to Data Analysis

Introduction to Data Analysis

  • UNIT OF ANALYSIS
  • POPULATION
  • SAMPLE
  • N & n
  • DESCRIPTIVE STATISTICS
  • INFERENTIAL STATISTICS
  • TIDY DATA
  • VARIABLES
  • DICHOTOMOUS
  • NOMINAL
  • ORDINAL
  • INTERVAL-RATIO

Unit of Analysis

Who or what is being studied?


POPULATION

All units of analysis (people, institutions, groups, etc.) in which the researcher is interested.


SAMPLE

A subset of people (or institutions, groups, etc.) selected from a population.

DESCRIPTIVE STATISTICS

Procedures that help us organize and describe data collected from a sample or population.


INFERENTIAL STATISTICS

Making underlying predictions or inferences about a population using observations and analyses from a sample.

Tidy Data

VARIABLES

Any factor, trait, or condition that can exist in differing amounts or types.

Measurement Levels

Dichotomous (aka binary)
A variable with only two categories.

Nominal
A variable made up of categories that cannot be ordered

Ordinal
A variable made up of ranked categories, with no systematic or measurable numeric difference between the categories.

Continuous (aka interval-ratio)
A variable with categories that are ordered and expressed in the same units.

Learning to Code

Technology is fun!


You’re not just learning the statistical concepts in this course, but how to produce the statistics. Analyzing data requires learning to use new technology.


Learning statistical software to analyze data can be really fun. You get to learn about real world social problems!

Technology is challenging!

It can be frustrating.


When it feels like the technology is preventing you from getting to the course content, take a deep breath, and remember that building your technology skills is part of this course.

Replication using technology requires researchers to sometimes use unfamiliar software, working on devices with unique environments and settings.

There’s even a bingo card of common errors (i.e. bugs) that new statistical programmers will expect to experience.

Why am I making you learn something so frustrating?


Calculating the statistics by hand quickly gets cumbersome, time consuming, and difficult.


Good social science is built on replication.

You’ll learn the statistical techniques using small sample sizes, but to really understand the social world, datasets typically have 100s, 1,000s, even 100,000s of values.

It is impractical, and more prone to errors, for scientists to replicate research by hand. Replication of statistical procedures helps catch minor coding errors, highlights unusual decisions made by researchers, inappropriate statistical techniques, and corrupted data.

Grappling

Learning to use statistical software necessitates grappling.

Grappling implies trying even before you fail the first time.


It’s thinking, “First, I’ll work with it independently. Okay, I’m really not understanding it. Let me go back to my notes. Okay, I have solved for the first part of it. Now I have the second part of it. Okay, I got the question wrong; let me try again. Maybe I can ask my peer now.”


Grappling is working hard to make sure you understand the problem fully, and then using every resource at your fingertips to solve it.”

In this course, this means you’ll put your active learning skills to use. You, not your professor or TA, will work through encountered problems. You, of course, will be supported and coached through the entire process. Working with statistical software will become easier as you build your skills in problem-solving errors. Unfortunately, it gets increasingly difficult if you rely on others to solve the issue, as the errors stack-up.

Most statistical analyses happen not because the person is a math genius, but because they persisted through the minefield of technical issues by being excellent problem-solvers.

Get comfortable with making mistakes right now. Your code is not expected to be perfect the first time. Remember, identifying and fixing errors in your own code is such an inherent part of the process, there’s even a name for it: debugging.

Coding is mostly Googling


It is a misconception that the best statistical analysts sit down at their computers and type code from memory.


Much of the process of coding is copying code from somewhere else and modifying it to fit your particular situation.

Learning to analyze data with software requires a lot of practice and attention to detail. It also requires a lot of time searching the internet for help. Learning to identify the right words and phrases in a Google search is part of building your coding skills.

When you get stuck…

…there are many options to get unstuck:

  • Review the slides. Pay very close attention to small details.
  • Try something else to see if you get a new error.
  • Use Google to search for possible answers or new explanations.
  • Watch a help video on YouTube on the topic.
  • Re-start your web-browser or device.
  • Try another web-browser or device.
  • Ask a peer. Or an advanced student.
  • Start or join a weekly study group.
  • Post the question on the class discussion board.
  • Email your TA

When none of these strategies fix the issue, it is time to ask for help.

Help in this class

Before requesting an individual meeting with a TA:

  • Spend a sufficient amount of time working on it on your own.
  • Ask two of your peers.
  • Post the question on the class discussion board.

When emailing:

  • Explain what troubleshooting steps you’ve already taken.
  • Report who you’ve already asked for help.

If none of these solve your problem, draft an email to your TA, with detailed notes about the problem and the troubleshooting steps you’ve already taken. You might be surprised how often even writing the problem out in detail helps you find the answer on your own. Send your email if you still need assistance.

Create a trail!

Use replication principles when asking for help. The best quantitative researchers produce a trail for their code so that future researchers can replicate their analysis.

Create a reproducible example


Goal: Make someone else feel your pain!

  • Assume others know nothing about your issue. 
  • Describe your steps to create the problem so that someone else can replicate it. 
  • This means clearly describing the issue and the steps you’ve already taken to solve it. 

Good etiquette

Search for answers before posting your question.
Let me google that for you. 🙄 

Describe the problem.
“It doesn’t work” isn’t descriptive enough. 

Describe your environment.
What operating system are you using? Which R version? What packages? Dataset?

Describe the solution.
Confirm if a solution offered works. Or, if you solve it on your own, post how you solved it.